Key Issues in Vowel Based Splitting of Telugu Bigrams
نویسنده
چکیده
Splitting of compound Telugu words into its components or root words is one of the important, tedious and yet inaccurate tasks of Natural Language Processing (NLP). Except in few special cases, at least one vowel is necessarily involved in Telugu conjunctions. In the result, vowels are often repeated as they are or are converted into other vowels or consonants. This paper describes issues involved in vowel based splitting of a Telugu bigram into proper root words using Telugu grammar conjunction (‘sandhi’) rules for MT. Keywords—Telugu word splitting; vowel based splitting; compound word splitting; bigrams; trigrams; n-grams; NLP
منابع مشابه
Telugu Bigram Splitting using Consonant-based and Phrase-based Splitting
Splitting is a conventional process in most of Indian languages according to their grammar rules. It is called ‘pada vicchEdanam’ (a Sanskrit term for word splitting) and is widely used by most of the Indian languages. Splitting plays a key role in Machine Translation (MT) particularly when the source language (SL) is an Indian language. Though this splitting may not succeed completely in extra...
متن کاملWordform- and Class-based Prediction of the Components of German Nominal Compounds in an AAC System
In word prediction systems for augmentative and alternative communication (AAC), productive wordformation processes such as compounding pose a serious problem. We present a model that predicts German nominal compounds by splitting them into their modifier and head components, instead of trying to predict them as a whole. The model is improved further by the use of class-based modifierhead bigra...
متن کاملOnline Recognition of Handwritten Telugu Characters
A system for online recognition of handwritten Telugu script is presented. A handwritten character is constructed by executing a sequence of strokes. A structureor shape-based representation of a stroke is used in which a stroke is represented as a string of shape features. Using this string representation, an unknown stroke is identified by comparing it with a database of strokes. A full chara...
متن کاملVowel Identification Using Piecewise Separation Technique
A simple method for computer recognition of Telugu speech sounds irrespective of speakers is described. A vocabu bulary consisting of 871 Telugu words containing the ten vowels (f'O/./a :/,/i/,/i :/,/u/,/u :/,/e/,/e :/./0/ and /0 :/) in constant vowel nucleus-consonant (CNC) combination and uttered by three informants was selected as the testing material. Formant frequencies. Flo F2 and F3 of...
متن کاملClassification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm
Classifying and recognizing Telugu characters (aksharas) is a challenging task because of the variations in the script and the large number of characters. The complexity of the shape is a result of structural compositions involving vowels (V), consonants (C), consonants with vowel modifiers (CV) and consonant clusters (CCV). This paper presents a novel classification strategy for classifying ak...
متن کامل